Problem Definition During development, motor neurons change rapidly as embryonic time progresses. At the molecular level, those changes should be reflected in each cellās gene expression profile. Our goal is to test whether embryonic age is actually detectable as a structure in high-dimensional single-cell RNA-seq data from motor neurons.
The input to our system is a set of preprocessed .h5ad files, one per embryonic age (e09, e10, e12, e13, e14). Each file contains:
A gene expression matrix (cells Ć genes) for motor neuronāenriched samples at that age Cell-level metadata including the embryonic age We combine these files into a single AnnData object where each row is a cell and each column is a gene, and we add an embryonic_age label to each cell. The core question is:
Do cells from different embryonic ages occupy distinct regions of gene expression space, or are they intermixed?
To answer this, our system performs a full exploratory and statistical analysis:
Quality and composition check. We verify that the combined dataset is balanced enough across ages and inspect basic properties like the number of cells per age. Dimensionality reduction. We apply Principal Component Analysis (PCA) followed by UMAP to compress thousands of genes into a 2D embedding. We color points by embryonic age to see if ages form clusters or gradients. Quantitative separation metrics. Beyond eyeballing plots, we compute a silhouette score that measures how well embryonic ages are separated in the UMAP space. Differential gene expression (DGE). We identify genes that are significantly up- or down-regulated at each embryonic age compared to the others. These āmarkerā genes highlight which molecular programs distinguish early from late stages. Marker visualization. We visualize top age-specific genes using heatmaps, dotplots, and UMAP overlays to see how their expression patterns track the spatial structure of the embedding. The outputs of the system are therefore:
PCA and UMAP plots colored by embryonic age A numerical silhouette score measuring age separation Ranked lists of age-specific marker genes from DGE Marker gene heatmaps, dotplots, and UMAP expression overlays If embryonic age is truly encoded in the gene expression profiles, we expect to see (i) clear age-structured patterns in PCA/UMAP space and (ii) coherent sets of genes whose expression changes systematically with age. If we instead find no meaningful clustering or only weak DGE, that would suggest embryonic age is not strongly separable in this dataset or that other factors (batch, technical noise, subtypes) dominate the structure.
!pip install scanpy anndata scipy
Requirement already satisfied: scanpy in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (1.11.5) Requirement already satisfied: anndata in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (0.12.6) Requirement already satisfied: scipy in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (1.16.3) Requirement already satisfied: h5py>=3.7.0 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (3.15.1) Requirement already satisfied: joblib in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (1.5.2) Requirement already satisfied: legacy-api-wrap>=1.4.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (1.5) Requirement already satisfied: matplotlib>=3.7.5 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (3.10.7) Requirement already satisfied: natsort in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (8.4.0) Requirement already satisfied: networkx>=2.7.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (3.6) Requirement already satisfied: numba!=0.62.0rc1,>=0.57.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.63.0) Requirement already satisfied: numpy>=1.24.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (2.3.5) Requirement already satisfied: packaging>=21.3 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (25.0) Requirement already satisfied: pandas>=1.5.3 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (2.3.3) Requirement already satisfied: patsy!=1.0.0 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (1.0.2) Requirement already satisfied: pynndescent>=0.5.13 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.5.13) Requirement already satisfied: scikit-learn>=1.1.3 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (1.7.2) Requirement already satisfied: seaborn>=0.13.2 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.13.2) Requirement already satisfied: session-info2 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.2.3) Requirement already satisfied: statsmodels>=0.14.5 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.14.6) Requirement already satisfied: tqdm in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (4.67.1) Requirement already satisfied: typing-extensions in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (4.15.0) Requirement already satisfied: umap-learn>=0.5.6 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scanpy) (0.5.9.post2) Requirement already satisfied: array-api-compat>=1.7.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from anndata) (1.12.0) Requirement already satisfied: zarr!=3.0.*,>=2.18.7 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from anndata) (3.1.5) Requirement already satisfied: contourpy>=1.0.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (1.3.3) Requirement already satisfied: cycler>=0.10 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (4.61.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (1.4.9) Requirement already satisfied: pillow>=8 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (12.0.0) Requirement already satisfied: pyparsing>=3 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (3.2.5) Requirement already satisfied: python-dateutil>=2.7 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from matplotlib>=3.7.5->scanpy) (2.9.0.post0) Requirement already satisfied: llvmlite<0.47,>=0.46.0dev0 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from numba!=0.62.0rc1,>=0.57.1->scanpy) (0.46.0) Requirement already satisfied: pytz>=2020.1 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from pandas>=1.5.3->scanpy) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from pandas>=1.5.3->scanpy) (2025.2) Requirement already satisfied: six>=1.5 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib>=3.7.5->scanpy) (1.17.0) Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scikit-learn>=1.1.3->scanpy) (3.6.0) Requirement already satisfied: donfig>=0.8 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from zarr!=3.0.*,>=2.18.7->anndata) (0.8.1.post1) Requirement already satisfied: google-crc32c>=1.5 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from zarr!=3.0.*,>=2.18.7->anndata) (1.7.1) Requirement already satisfied: numcodecs>=0.14 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from zarr!=3.0.*,>=2.18.7->anndata) (0.16.5) Requirement already satisfied: pyyaml in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from donfig>=0.8->zarr!=3.0.*,>=2.18.7->anndata) (6.0.3)
import os
import glob
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_dir = "rna_data"
file_paths = sorted(glob.glob(os.path.join(data_dir, "*_integrated.h5ad")))
adatas = []
for path in file_paths:
print(path)
age_label = os.path.basename(path).split("_")[0]
tmp = sc.read_h5ad(path)
tmp.obs["embryonic_age"] = age_label
adatas.append(tmp)
adata = ad.concat(adatas, join="outer")
adata.obs_names_make_unique()
adata.obs["embryonic_age"] = adata.obs["embryonic_age"].astype("category")
adata
rna_data/e09_integrated.h5ad rna_data/e10_integrated.h5ad rna_data/e11_integrated.h5ad rna_data/e12_integrated.h5ad rna_data/e13_integrated.h5ad rna_data/e14_integrated.h5ad
AnnData object with n_obs Ć n_vars = 99989 Ć 29184
obs: 'age', '10X', 'sample_type', 'Sequencer', 'Flowcell', 'Collection', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'batch', 'leiden', 'doublets', 'predicted_labels', 'majority_voting', 'conf_score', 'celltypes', 'over_clustering', 'batch_id', 'embryonic_age'
obsm: 'X_pca', 'X_scanorama', 'X_umap'
layers: 'counts'
print(adata)
age_counts = adata.obs["embryonic_age"].value_counts().sort_index()
print(age_counts)
plt.figure(figsize=(5, 4))
age_counts.plot(kind="bar")
plt.xlabel("Embryonic age")
plt.ylabel("Number of cells")
plt.title("Cells per embryonic age")
plt.tight_layout()
plt.show()
AnnData object with n_obs Ć n_vars = 99989 Ć 29184
obs: 'age', '10X', 'sample_type', 'Sequencer', 'Flowcell', 'Collection', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'batch', 'leiden', 'doublets', 'predicted_labels', 'majority_voting', 'conf_score', 'celltypes', 'over_clustering', 'batch_id', 'embryonic_age'
obsm: 'X_pca', 'X_scanorama', 'X_umap'
layers: 'counts'
embryonic_age
e09 20821
e10 29409
e11 22702
e12 10031
e13 12868
e14 4158
Name: count, dtype: int64
!pip install --user scikit-misc
Requirement already satisfied: scikit-misc in /Users/sohumpulijal/.local/lib/python3.11/site-packages (0.5.2) Requirement already satisfied: numpy>=1.26.4 in /opt/miniconda3/envs/project_env/lib/python3.11/site-packages (from scikit-misc) (2.3.5)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3")
adata = adata[:, adata.var["highly_variable"]]
sc.pp.scale(adata, max_value=10)
adata
/opt/miniconda3/envs/project_env/lib/python3.11/site-packages/legacy_api_wrap/__init__.py:88: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found. return fn(*args_all, **kw) /opt/miniconda3/envs/project_env/lib/python3.11/site-packages/scanpy/preprocessing/_scale.py:309: UserWarning: Received a view of an AnnData. Making a copy. view_to_actual(adata) /opt/miniconda3/envs/project_env/lib/python3.11/functools.py:909: UserWarning: zero-centering a sparse array/matrix densifies it. return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs Ć n_vars = 99989 Ć 2000
obs: 'age', '10X', 'sample_type', 'Sequencer', 'Flowcell', 'Collection', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'batch', 'leiden', 'doublets', 'predicted_labels', 'majority_voting', 'conf_score', 'celltypes', 'over_clustering', 'batch_id', 'embryonic_age'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'mean', 'std'
uns: 'log1p', 'hvg'
obsm: 'X_pca', 'X_scanorama', 'X_umap'
layers: 'counts'
sc.set_figure_params(dpi=100)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40)
sc.tl.umap(adata)
sc.pl.pca(adata, color="embryonic_age")
sc.pl.umap(adata, color="embryonic_age")
from sklearn.metrics import silhouette_score
embedding = adata.obsm["X_umap"]
labels = adata.obs["embryonic_age"].cat.codes.values
silhouette = silhouette_score(embedding, labels)
print("Silhouette score for embryonic age on UMAP embedding:", silhouette)
Silhouette score for embryonic age on UMAP embedding: -0.07712282240390778
sc.tl.rank_genes_groups(
adata,
groupby="embryonic_age",
method="wilcoxon",
n_genes=50
)
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
sc.pl.rank_genes_groups_heatmap(
adata,
n_genes=10,
groupby="embryonic_age",
show_gene_labels=True,
swap_axes=True
)
sc.pl.rank_genes_groups_dotplot(
adata,
n_genes=10,
groupby="embryonic_age"
)
WARNING: dendrogram data not found (using key=dendrogram_embryonic_age). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
result = adata.uns["rank_genes_groups"]
groups = result["names"].dtype.names
top_genes = []
for g in groups:
top_genes.extend(list(result["names"][g][:3]))
top_genes = list(dict.fromkeys(top_genes))
sc.pl.umap(
adata,
color=top_genes,
ncols=4
)
import numpy as np
from sklearn.neighbors import NearestNeighbors
X = adata.obsm["X_umap"]
ages = adata.obs["embryonic_age"].astype(str).values
states = np.unique(ages)
n_states = len(states)
state_to_idx = {s: i for i, s in enumerate(states)}
idx_to_state = {i: s for s, i in state_to_idx.items()}
k = 25
nbrs = NearestNeighbors(n_neighbors=k + 1, algorithm="auto").fit(X)
distances, neighbor_indices = nbrs.kneighbors(X)
neighbor_indices = neighbor_indices[:, 1:]
T_counts = np.zeros((n_states, n_states), dtype=float)
for i, neighs in enumerate(neighbor_indices):
from_age = ages[i]
from_idx = state_to_idx[from_age]
for j in neighs:
to_age = ages[j]
to_idx = state_to_idx[to_age]
T_counts[from_idx, to_idx] += 1
T = T_counts / T_counts.sum(axis=1, keepdims=True)
print("States in order:", states)
print("Transition matrix (rows = from, columns = to):")
np.set_printoptions(precision=3, suppress=True)
print(T)
States in order: ['e09' 'e10' 'e11' 'e12' 'e13' 'e14'] Transition matrix (rows = from, columns = to): [[0.84 0.119 0.027 0.007 0.005 0.001] [0.083 0.823 0.071 0.014 0.007 0.002] [0.025 0.098 0.752 0.102 0.015 0.008] [0.015 0.046 0.231 0.637 0.049 0.022] [0.009 0.017 0.028 0.036 0.794 0.115] [0.009 0.015 0.047 0.049 0.356 0.523]]
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6,5))
sns.heatmap(
T,
annot=True,
fmt=".2f",
xticklabels=states,
yticklabels=states
)
plt.xlabel("Next age")
plt.ylabel("Current age")
plt.title("Transition Probabilites through Embroynic Age ")
plt.tight_layout()
plt.show()
rng = np.random.default_rng(0)
def simulate_markov_chain(T, states, start_state, n_steps=10):
seq = [start_state]
current = start_state
for _ in range(n_steps):
i = state_to_idx[current]
probs = T[i]
next_idx = rng.choice(len(states), p=probs)
current = states[next_idx]
seq.append(current)
return seq
path = simulate_markov_chain(T, states, start_state="e09", n_steps=40)
print("Simulated developmental path:", " -> ".join(path))
Simulated developmental path: e09 -> e09 -> e09 -> e09 -> e09 -> e09 -> e10 -> e10 -> e10 -> e10 -> e11 -> e11 -> e09 -> e10 -> e09 -> e09 -> e09 -> e10 -> e10 -> e10 -> e10 -> e09 -> e09 -> e09 -> e09 -> e09 -> e09 -> e13 -> e14 -> e14 -> e14 -> e14 -> e13 -> e13 -> e13 -> e13 -> e13 -> e13 -> e14 -> e14 -> e13